Here we will map an example dataset from Roy et al (Cell Rep 2019)
comprising 12,245 CD34+ cells from fetal and adult bone marrow. Note
that any human hematopoietic cells, regardless of tissue source or
disease status, can be mapped to this reference. This tutorial will take
approximately 10 minutes to run
Setup
library(Seurat)
library(tidyverse)
library(symphony)
library(ggpubr)
library(patchwork)
Install package from github
devtools::install_github('andygxzeng/BoneMarrowMap', force = TRUE)
library(BoneMarrowMap)
Download reference object and UMAP model
Set projection folder, Download reference object and UMAP model
# Set directory to store projection reference files
projection_path = './'
# Download Bone Marrow Reference - 344 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_SymphonyRef.rds',
destfile = paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Download uwot model file - 221 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_uwot_model.uwot',
destfile = paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot'))
# Load Symphony reference
ref <- readRDS(paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Set uwot path for UMAP projection
ref$save_uwot_path <- paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot')
Visualize Bone Marrow Reference
If we want to visualize celltype labels or metadata from the BM
Reference, we can create a Seurat Object from the symphony reference
This will be memory efficient as it will not include gene expression
counts, only the UMAP coordinates and the metadata including cell labels
and sorting information

We can visualize other annotations too, including cell cycle phase
and lineage pseudotime estimates.
p1 <- DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CyclePhase', raster=FALSE)
p2 <- FeaturePlot(ReferenceSeuratObj, reduction = 'umap', features = 'Pseudotime', raster=FALSE)
p1 + p2

Load Query Seurat object
Example Query object is a subset of data from Roy et al (Cell Reports
2021), incorporating CD34p cells from an adult bone marrow sample and a
fetal bone marrow sample. Here, we prefer raw count data without
normalization.
Input does not need to be a seurat object, can load raw count matrix
and metadata separately
# Load seurat object
query <- readRDS(paste0(projection_path, 'ExampleQuery_Roy2021.rds'))
query
An object of class Seurat
33538 features across 12245 samples within 1 assay
Active assay: RNA (33538 features, 0 variable features)
Map the Query Data
Provide raw counts, metadata, and donor key. This should take <1
min Calculate mapping error and perform QC to remove low quality cells
with high mapping error
# batch variable to correct in the query data, set as NULL if no batches in query
batchvar <- 'sampleID'
# Map query dataset using Symphony (Kang et al 2021)
query <- map_Query(
exp_query = query@assays$RNA@counts,
metadata_query = query@meta.data,
ref_obj = ref,
vars = batchvar
)
Normalizing
Scaling and synchronizing query gene expression
Found 2386 reference variable genes in query dataset
Project query cells using reference gene loadings
Clustering query cells to reference centroids
Correcting query batch effects
UMAP
All done!
Warning: Invalid name supplied, making object name syntactically valid. New object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details on syntax validity
Now that the data is mapped, we will evaluate the mapping QC metrics
and flag cells with high mapping errors
# Run QC based on mapping error score, flag cells with mapping error >= 2.5 MADs above median
query <- query %>% calculate_MappingError(., reference = ref, MAD.threshold = 2.5)
# Get QC Plots
QC_plots <- plot_MappingErrorQC(query)
# Plot together - If this is too crowded, can also just call "QC_plots" to display one by one
patchwork::wrap_plots(QC_plots, ncol = 4, widths = c(0.8, 0.3, 0.8, 0.3))

This important step identifies a subset of cells with high mapping
error from the query dataset that are either: 1) not present within the
reference, or 2) have poor QC metrics (low RNA counts and low
transcriptional diversity)
Sometimes, low quality cells may erroneously map to the
orthochromatic erythroblast region as this cell type has very low
transcriptional diversity. These low quality query cells do not have
hemoglobin expression and are in fact mis-mapped; they will be flagged
by the QC filter and excluded from cell type assignments.
Please adjust the MAD.threshold (typically between 1 and 3)
based on the distribution of your dataset to identify the outliers with
low quality and high mapping error scores. This will improve your
classifications and any downstream composition analysis
# # Optional step - remove outliers with high mapping error
# query <- subset(query, mapping_error_QC == 'Pass')
Optionally, outlier cells with high mapping error can also be removed
at this stage. For ease of integrating these mapped annotations with the
rest of your analysis, we also choose to skip this step. If so, Final
CellType and Pseudotime predictions will be assigned as NA for cells
failing the mapping error QC threshold.
Cell Type Assignments
We will next use a KNN classifier to assign cell identity based on
the 30 K-Nearest Neighbours from the reference map. This label transfer
step will take longer, potentially around 10 minutes for ~10,000
cells
# Predict Hematopoietic Cell Types by KNN classification
query <- predict_CellTypes(
query_obj = query,
ref_obj = ref,
initial_label = 'initial_CellType', # celltype assignments before mapping QC
final_label = 'predicted_CellType' # celltype assignments with map QC failing cells assigned as NA
)
DimPlot(subset(query, mapping_error_QC == 'Pass'), reduction = 'umap', group.by = c('predicted_CellType'), raster=FALSE, label=TRUE, label.size = 4)

Pseudotime Annotations
We can also annotate each query cell based on their position along
hematopoietic pseudotime. Query cells will be assigned a pseudotime
score based on the 30 K-Nearest Neighbours from the reference map. Since
our Pseudotime KNN assignments are performed in UMAP space (more
accurate than KNN on harmony components), this step is very fast (<
10s)
# Predict Pseudotime values by KNN
query <- predict_Pseudotime(
query_obj = query,
ref_obj = ref,
initial_label = 'initial_Pseudotime',
final_label = 'predicted_Pseudotime'
)
# Visualize Hematopoietic Pseudotime in query data
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), features = c('predicted_Pseudotime'))

Save projection results
This will save a csv file with the mapped annotations for each cell
(mapping error scores, umap coordinates, predicted Cell Type, and
predicted Pseudotime)
# Save CellType Annotations and Projected UMAP coordinates
save_ProjectionResults(
query_obj = query,
celltype_label = 'predicted_CellType',
celltype_KNNprob_label = 'predicted_CellType_prob',
pseudotime_label = 'predicted_Pseudotime',
file_name = 'querydata_projected_labeled.csv')
Visualize Projection Density
Now let’s visualize the density distribution of query cells across
the hematopoietic hierarchy

We can also set Hierarchy_only = TRUE to remove T/NK/Plasma/Stromal
cells and focus solely on the hematopoietic hierarchy.

Note how cell composition differs between CD34p sorted Adult Bone
Marrow and Fetal Bone Marrow within the Roy 2021 dataset. Notably, Adult
bone marrow is enriched for more early HSC/MPP and LMPP cells and also
MEP & Early Erythroid lineages. In contrast, Fetal bone marrow is
highly enriched for MLP and B cell progenitors.
We can also compare the mapping of these samples along Hematopoietic
pseudotime:
Get Composition data for each donor
Here, to study the abundance of each cell type within each donor, I
focus on cells that were classified with a KNN prob > 0.5 (that is,
>50% of nearest neighbours from the reference map agree on the
assigned cell type).
We can present this as a long table
query_composition <- get_Composition(
query_obj = query,
donor_key = 'sampleID',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = NULL,
return_type = 'long')
query_composition
Or as a wide table with counts of # of cells
query_composition <- get_Composition(
query_obj = query,
donor_key = 'sampleID',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = NULL,
return_type = 'count')
query_composition
Or as a wide table with proportion of each cell type within each
donor
query_composition <- get_Composition(
query_obj = query,
donor_key = 'sampleID',
celltype_label = 'predicted_CellType',
mapQC_col = 'mapping_error_QC',
knn_prob_cutoff = NULL,
return_type = 'proportion')
query_composition
# Simple heatmap to visualize composition of projected samples
p <- query_composition %>% column_to_rownames('sampleID') %>% data.matrix() %>% ComplexHeatmap::Heatmap()
p

---
title: "Bone Marrow Reference Map Projection Tutorial"
output: html_notebook
---

Here we will map an example dataset from Roy et al (Cell Rep 2019) comprising 12,245 CD34+ cells from fetal and adult bone marrow.
Note that any human hematopoietic cells, regardless of tissue source or disease status, can be mapped to this reference.
This tutorial will take approximately 10 minutes to run

### Setup

```{r}
library(Seurat)
library(tidyverse)
library(symphony)
library(ggpubr)
library(patchwork)
```

Install package from github

```{r}
devtools::install_github('andygxzeng/BoneMarrowMap', force = TRUE)
library(BoneMarrowMap)
```


#### Download reference object and UMAP model

Set projection folder, Download reference object and UMAP model

```{r}
# Set directory to store projection reference files
projection_path = './'

# Download Bone Marrow Reference - 344 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_SymphonyRef.rds', 
                    destfile = paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Download uwot model file - 221 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/BoneMarrow_RefMap_uwot_model.uwot', 
                    destfile = paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot'))

# Load Symphony reference
ref <- readRDS(paste0(projection_path, 'BoneMarrow_RefMap_SymphonyRef.rds'))
# Set uwot path for UMAP projection
ref$save_uwot_path <- paste0(projection_path, 'BoneMarrow_RefMap_uwot_model.uwot')
```


#### Visualize Bone Marrow Reference

If we want to visualize celltype labels or metadata from the BM Reference, we can create a Seurat Object from the symphony reference 
This will be memory efficient as it will not include gene expression counts, only the UMAP coordinates and the metadata including cell labels and sorting information

```{r, fig.height=5, fig.width=11}
ReferenceSeuratObj <- create_ReferenceObject(ref)

DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CellType_Annotation_formatted', 
        raster=FALSE, label=TRUE, label.size = 4)
```

We can visualize other annotations too, including cell cycle phase and lineage pseudotime estimates.

```{r, fig.height=3.5, fig.width=11}
p1 <- DimPlot(ReferenceSeuratObj, reduction = 'umap', group.by = 'CyclePhase', raster=FALSE)
p2 <- FeaturePlot(ReferenceSeuratObj, reduction = 'umap', features = 'Pseudotime', raster=FALSE) 

p1 + p2
```


#### Load Query Seurat object
Example Query object is a subset of data from Roy et al (Cell Reports 2021), incorporating CD34p cells from an adult bone marrow sample and a fetal bone marrow sample. Here, we prefer raw count data without normalization.

Input does not need to be a seurat object, can load raw count matrix and metadata separately

```{r}
# Load example data from Roy et al - 141 Mb
curl::curl_download('https://bonemarrowmap.s3.us-east-2.amazonaws.com/ExampleQuery_Roy2021.rds',
                    destfile = paste0(projection_path, 'ExampleQuery_Roy2021.rds'))

# Load seurat object
query <- readRDS(paste0(projection_path, 'ExampleQuery_Roy2021.rds'))
query
```

### Map the Query Data
Provide raw counts, metadata, and donor key. This should take <1 min
Calculate mapping error and perform QC to remove low quality cells with high mapping error

```{r}
# batch variable to correct in the query data, set as NULL if no batches in query
batchvar <- 'sampleID'

# Map query dataset using Symphony (Kang et al 2021)
query <- map_Query(
    exp_query = query@assays$RNA@counts, 
    metadata_query = query@meta.data,
    ref_obj = ref,
    vars = batchvar
)
```

Now that the data is mapped, we will evaluate the mapping QC metrics and flag cells with high mapping errors

```{r, fig.height=3, fig.width=10}
# Run QC based on mapping error score, flag cells with mapping error >= 2.5 MADs above median
query <- query %>% calculate_MappingError(., reference = ref, MAD.threshold = 2.5) 

# Get QC Plots
QC_plots <- plot_MappingErrorQC(query)

# Plot together - If this is too crowded, can also just call "QC_plots" aloneto display one by one
patchwork::wrap_plots(QC_plots, ncol = 4, widths = c(0.8, 0.3, 0.8, 0.3))
```


This important step identifies a subset of cells with high mapping error from the query dataset that are either:
  1) not present within the reference, or
  2) have poor QC metrics (low RNA counts and low transcriptional diversity)

Sometimes, low quality cells may erroneously map to the orthochromatic erythroblast region as this cell type has very low transcriptional diversity. 
These low quality query cells do not have hemoglobin expression and are in fact mis-mapped; they will be flagged by the QC filter and excluded from cell type assignments.

**Please adjust the MAD.threshold (typically between 1 and 3) based on the distribution of your dataset to identify the outliers with low quality and high mapping error scores. This will improve your classifications and any downstream composition analysis**

```{r}
# # Optional step - remove outliers with high mapping error
# query <- subset(query, mapping_error_QC == 'Pass')
```

Optionally, outlier cells with high mapping error can also be removed at this stage.
For ease of integrating these mapped annotations with the rest of your analysis, we also choose to skip this step. If so, Final CellType and Pseudotime predictions will be assigned as NA for cells failing the mapping error QC threshold. 


### Cell Type Assignments
We will next use a KNN classifier to assign cell identity based on the 30 K-Nearest Neighbours from the reference map.
This label transfer step will take longer, potentially around 10 minutes for ~10,000 cells 

```{r, fig.height=5, fig.width=10}
# Predict Hematopoietic Cell Types by KNN classification
query <- predict_CellTypes(
  query_obj = query, 
  ref_obj = ref, 
  initial_label = 'initial_CellType', # celltype assignments before mapping QC
  final_label = 'predicted_CellType'  # celltype assignments with map QC failing cells assigned as NA
) 

DimPlot(subset(query, mapping_error_QC == 'Pass'), reduction = 'umap', group.by = c('predicted_CellType'), raster=FALSE, label=TRUE, label.size = 4)
```


#### Pseudotime Annotations
We can also annotate each query cell based on their position along hematopoietic pseudotime. 
Query cells will be assigned a pseudotime score based on the 30 K-Nearest Neighbours from the reference map.
Since our Pseudotime KNN assignments are performed in UMAP space (more accurate than KNN on harmony components), this step is very fast (< 10s)

```{r}
# Predict Pseudotime values by KNN
query <- predict_Pseudotime(
  query_obj = query, 
  ref_obj = ref, 
  initial_label = 'initial_Pseudotime', 
  final_label = 'predicted_Pseudotime'
)

# Visualize Hematopoietic Pseudotime in query data
FeaturePlot(subset(query, mapping_error_QC == 'Pass'), features = c('predicted_Pseudotime'))
```


### Save projection results

This will save a csv file with the mapped annotations for each cell (mapping error scores, umap coordinates, predicted Cell Type, and predicted Pseudotime)

```{r}
# Save CellType Annotations and Projected UMAP coordinates
save_ProjectionResults(
  query_obj = query, 
  celltype_label = 'predicted_CellType', 
  celltype_KNNprob_label = 'predicted_CellType_prob', 
  pseudotime_label = 'predicted_Pseudotime', 
  file_name = 'querydata_projected_labeled.csv')
```


### Visualize Projection Density

Now let's visualize the density distribution of query cells across the hematopoietic hierarchy

```{r, fig.height=3, fig.width=9}
# Set batch/condition to be visualized individually
batch_key <- 'sampleID'

# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
  query_obj = query, 
  batch_key = batch_key, 
  ref_obj = ref, 
  Hierarchy_only = FALSE, # Whether to exclude T/NK/Plasma/Stromal cells 
  downsample_reference = TRUE, 
  downsample_frac = 0.25,   # down-sample reference cells to 25%; reduces figure file size
  query_point_size = 0.2,   # adjust size of query cells based on # of cells
  saveplot = TRUE, 
  save_folder = 'projectionFigures/'
)

# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 2)
```

We can also set Hierarchy_only = TRUE to remove T/NK/Plasma/Stromal cells and focus solely on the hematopoietic hierarchy.

```{r, fig.height=3, fig.width=8}
# Set batch/condition to be visualized individually
batch_key <- 'sampleID'

# returns a list of plots for each donor from a pre-specified batch variable
projection_plots <- plot_Projection_byDonor(
  query_obj = query, 
  batch_key = batch_key, 
  ref_obj = ref, 
  Hierarchy_only = TRUE, # Whether to exclude T/NK/Plasma/Stromal cells 
  downsample_reference = TRUE, 
  downsample_frac = 0.25,   # down-sample reference cells to 25%; reduces figure file size
  query_point_size = 0.2,   # adjust size of query cells based on # of cells
  saveplot = TRUE, 
  save_folder = 'projectionFigures/'
)

# show plots together with patchwork. Can also just call "projection_plots" object to display one-by-one
patchwork::wrap_plots(projection_plots, ncol = 2)
```

Note how cell composition differs between CD34p sorted Adult Bone Marrow and Fetal Bone Marrow within the Roy 2021 dataset. 
Notably, Adult bone marrow is enriched for more early HSC/MPP and LMPP cells and also MEP & Early Erythroid lineages. 
In contrast, Fetal bone marrow is highly enriched for MLP and B cell progenitors. 

We can also compare the mapping of these samples along Hematopoietic pseudotime:


### Get Composition data for each donor

Here, to study the abundance of each cell type within each donor, I focus on cells that were classified with a KNN prob > 0.5 (that is, >50% of nearest neighbours from the reference map agree on the assigned cell type). 

We can present this as a long table

```{r}
query_composition <- get_Composition(
  query_obj = query, 
  donor_key = 'sampleID', 
  celltype_label = 'predicted_CellType', 
  mapQC_col = 'mapping_error_QC', 
  knn_prob_cutoff = NULL, 
  return_type = 'long')

query_composition 
```

Or as a wide table with counts of # of cells

```{r}
query_composition <- get_Composition(
  query_obj = query, 
  donor_key = 'sampleID', 
  celltype_label = 'predicted_CellType', 
  mapQC_col = 'mapping_error_QC', 
  knn_prob_cutoff = NULL, 
  return_type = 'count')

query_composition 
```

Or as a wide table with proportion of each cell type within each donor

```{r}
query_composition <- get_Composition(
  query_obj = query, 
  donor_key = 'sampleID', 
  celltype_label = 'predicted_CellType', 
  mapQC_col = 'mapping_error_QC', 
  knn_prob_cutoff = NULL, 
  return_type = 'proportion')

query_composition 
```


```{r}
# Simple heatmap to visualize composition of projected samples
p <- query_composition %>% column_to_rownames('sampleID') %>% data.matrix() %>% ComplexHeatmap::Heatmap()
p
```
